---
layout: layout.njk
permalink: "{{ page.filePathStem }}.html"
title: Smile - Data Processing
---
{% include "toc.njk" %}

<div class="col-md-9 col-md-pull-3">
    <h1 id="data-top" class="title">Data Processing</h1>

    <p>Machine learning is all about building models from data. However, data scientists
        frequently talk about models and algorithms first, which very likely generates
        suboptimal results. The other approach is to play with the data first. Even simple
        statistics and plots can help us get feelings of data and problems, which more likely
        lead us to better modelling.</p>

    <h2 id="features">Features</h2>

    <p>A feature is an individual measurable property of a phenomenon being observed.
        Features are also called explanatory variables, independent variables, predictors, regressors, etc.
        Any attribute could be a feature, but choosing informative, discriminating and
        independent features is a crucial step for effective algorithms in machine learning.
        Features are usually numeric and a set of numeric features can be conveniently
        described by a feature vector. Structural features such as strings, sequences and
        graphs are also used in areas such as natural language processing, computational biology, etc.</p>

    <p>Feature engineering is the process of using domain knowledge of the data to create features that make
        machine learning algorithms work. Feature engineering is fundamental to the application of machine
        learning, and is both difficult and expensive. It requires the experimentation of multiple
        possibilities and the combination of automated
        techniques with the intuition and knowledge of the domain expert.</p>

    <h2 title="data-type">Data Type</h2>
    <p>Generally speaking, there are two major types of attributes:</p>
        <dl>
            <dt>Qualitative variables:</dt>
            <dd><p>The data values are non-numeric categories. Examples: Blood type, Gender.</p></dd>
            <dt>Quantitative variables:</dt>
            <dd><p>The data values are counts or numerical measurements. A quantitative
                variable can be either discrete such as the number of students receiving
                an 'A' in a class, or continuous such as GPA, salary and so on.</p></dd>
        </dl>

    <p>Another way of classifying data is by the measurement scales. In statistics,
        there are four generally used measurement scales:</p>

        <dl>
            <dt>Nominal data:</dt>
            <dd><p>Data values are non-numeric group labels. For example, Gender variable
                can be defined as male = 0 and female =1.</p></dd>
            <dt>Ordinal data:</dt>
            <dd><p>Data values are categorical and may be ranked in some numerically
                meaningful way. For example, strongly disagree to strong agree may be
                defined as 1 to 5.</p></dd>
            <dt>Continuous data:</dt>
            <dd><ul>
                <li><strong>Interval data:</strong>
                Data values are ranged in a real interval, which can be as large as
                from negative infinity to positive infinity. The difference between two
                values are meaningful, however, the ratio of two interval data is not
                meaningful. For example temperature, IQ. </li>
                <li><strong>Ratio data:</strong>
                Both difference and ratio of two values are meaningful. For example,
                salary, weight.</li>
            </ul></dd>
        </dl>

    <p>Many machine learning algorithms can only handle numeric attributes while a few
        such as decision trees can process nominal attribute directly. Date attribute
        is useful in plotting. With some feature engineering, values like day of week
        can be used as nominal attribute. String attribute could be used in text mining
        and natural language processing.</p>

    <h2 id="DataFrame">DataFrame</h2>

    <p>While some Smile algorithms take simple <code>double[]</code> as input, we often use
        the encapsulation class <a href="api/java/smile/data/DataFrame.html">DataFrame</a>.
        DataFrame is a two-dimensional data structure like a table with rows
        and columns. Each column is a <a href="api/java/smile/data/vector/ValueVector.html">ValueVector</a>
        that is a one-dimensional labeled abstraction to store a sequence of values having the same type.
        Columns in a DataFrame may have different data types. There are concrete subclasses of ValueVector
        for each primitive data type and generic object types. Creating a ValueVector by passing a list of values.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0" data-toggle="tab">Java</a></li>
        <li><a href="#scala_0" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_0" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_0">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> import java.time._
           import smile.data._
           import smile.data.vector._

    smile> ValueVector.of("A", 1.0, 2.0, 3.0)
    $val res0: smile.data.vector.DoubleVector = A[1, 2, 3]

    smile> ValueVector.of("B", Instant.now())
    val res1: smile.data.vector.ObjectVector[java.time.Instant] = B[2025-01-19T01:58:06.813727400Z]

    smile> ValueVector.nominal("C", "test", "train", "test", "train")
    val res2: smile.data.vector.ValueVector = C[test, train, test, train]

    smile> ValueVector.of("D",
               "this is a string vector",
               "Nominal/ordinal vectors store data as integers internally")
    val res3: smile.data.vector.StringVector = D[this is a string vector, Nominal/ordinal vectors store data as integers internally]

    smile> ObjectVector.of("E", Index.range(0, 4).toArray(), Array(3, 3, 3, 3))
    val res4: smile.data.vector.ObjectVector[Array[Int]] = E[[0, 1, 2, 3], [3, 3, 3, 3]]
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_0">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> import java.time.*
           import smile.data.*
           import smile.data.vector.*

    smile> ValueVector.of("A", 1.0, 2.0, 3.0)
    $2 ==> A[1, 2, 3]

    smile> ValueVector.of("B", Instant.now())
    $3 ==> B[2025-01-18T23:08:54.375073600Z]

    smile> ValueVector.nominal("C", "test", "train", "test", "train")
    $4 ==> C[test, train, test, train]

    smile> ValueVector.of("D",
               "this is a string vector",
               "Nominal/ordinal vectors store data as integers internally")
    $5 ==> D[this is a string vector, Nominal/ordinal vectors store data as integers internally]

    smile> ObjectVector.of("E", Index.range(0, 4).toArray(), new int[]{3, 3, 3, 3})
    $6 ==> E[[0, 1, 2, 3], [3, 3, 3, 3]]
    </code></pre>
            </div>
        </div>
        <div class="tab-pane" id="kotlin_0">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> import java.time.*
        import smile.data.*
        import smile.data.vector.*
        import smile.util.*

    >>> ValueVector.of("A", 1.0, 2.0, 3.0)
    res4: smile.data.vector.DoubleVector = A[1, 2, 3]

    >>> ValueVector.of("B", Instant.now())
    res5: smile.data.vector.ObjectVector&lt;java.time.Instant?&gt; = B[2025-01-19T01:58:06.813727400Z]

    >>> ValueVector.nominal("C", "test", "train", "test", "train")
    res6: smile.data.vector.ValueVector = C[test, train, test, train]

    >>> ValueVector.of("D",
               "this is a string vector",
               "Nominal/ordinal vectors store data as integers internally")
    res7: smile.data.vector.StringVector = D[this is a string vector, Nominal/ordinal vectors store data as integers internally]

    >>> ObjectVector.of("E", Index.range(0, 4).toArray(), intArrayOf(3, 3, 3, 3))
    res8: smile.data.vector.ObjectVector&lt;kotlin.IntArray?&gt; = E[[0, 1, 2, 3], [3, 3, 3, 3]]
    </code></pre>
            </div>
        </div>
    </div>

    <p>Note that the <code>nominal</code> and <code>ordinal</code> methods
        factorize string values and store them as integral values internally,
        which are more efficient, compact, and friendly for machine learning
        algorithm. In contrast, the <code>ValueVector.of(String...)</code>
        method returns a <code>StringVector</code> that store string values
        as is, which is useful for text processing.</p>

    <h3 id="DataFrame_Creation">Creation</h3>
    <p>For illustration, we create a DataFrame with a 2-dimensional array in below example.
        If no optional column names are passed, the default column names will be
        V1, V2, etc. It is also easy to create a DataFrame by passing a list of
        columns. The columns of the second DataFrame have different data types.
        The method <code>schema()</code> will describe the column names,
        data types, whether they can be null.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0.1" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_0.1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df = DataFrame.of(MathEx.randn(6, 4))
    df ==>
    +---+---------+---------+---------+---------+
    |   |       V1|       V2|       V3|       V4|
    +---+---------+---------+---------+---------+
    |  0|-0.201469| 0.970363| 2.726932|-0.146014|
    |  1| 1.872161| 0.495932| 0.553859|-0.028237|
    |  2|-0.504866|-0.179409| 0.201377| 0.281267|
    |  3| 0.894446| 0.791521| 0.053346| 0.213519|
    |  4| 0.200011|-0.203736|-0.349196|-1.193759|
    |  5|  1.52529|-1.407597|  1.16758| -1.78291|
    +---+---------+---------+---------+---------+

    smile> var df = new DataFrame(
                   ValueVector.of("A", 1.0),
                   ValueVector.of("B", LocalDate.parse("2013-01-02")),
                   ValueVector.of("C", "foo"),
                   ObjectVector.of("D", Index.range(0, 4).toArray()),
                   ObjectVector.of("E", new int[]{3, 3, 3, 3})
           )
    df ==>
    +---+---+----------+---+------------+------------+
    |   |  A|         B|  C|           D|           E|
    +---+---+----------+---+------------+------------+
    |  0|  1|2013-01-02|foo|[0, 1, 2, 3]|[3, 3, 3, 3]|
    +---+---+----------+---+------------+------------+

    smile> df.schema()
    $4 ==> {
      A: double NOT NULL,
      B: Date,
      C: String,
      D: int[],
      E: int[]
    }
    </code></pre>
            </div>
        </div>
    </div>

    <p>We can create a DataFrame with a collection of records or beans.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0.2" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_0.2">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> enum Gender {Male, Female}
           record Person(String name, Gender gender, String state, LocalDate birthday, int age, Double salary) { }
           List&lt;Person&gt; persons = new ArrayList<>();
           persons.add(new Person("Alex", Gender.Male, "NY", LocalDate.of(1980, 10, 1), 38, 10000.));
           persons.add(new Person("Bob", Gender.Male, "AZ", LocalDate.of(1995, 3, 4), 23, null));
           persons.add(new Person("Jane", Gender.Female, "CA", LocalDate.of(1970, 3, 1), 48, 230000.));
           persons.add(new Person("Amy", Gender.Female, "NY", LocalDate.of(2005, 12, 10), 13, null));
           var df = DataFrame.of(Person.class, persons);
    df ==>
    +---+----+------+-----+----------+---+------+
    |   |name|gender|state|  birthday|age|salary|
    +---+----+------+-----+----------+---+------+
    |  0|Alex|  Male|   NY|1980-10-01| 38| 10000|
    |  1| Bob|  Male|   AZ|1995-03-04| 23|  null|
    |  2|Jane|Female|   CA|1970-03-01| 48|230000|
    |  3| Amy|Female|   NY|2005-12-10| 13|  null|
    +---+----+------+-----+----------+---+------+
    </code></pre>
            </div>
        </div>
    </div>

    <p>In above example, the column 'state' is of string type. Apparently, it is a categorical variable.
        We can factorize such string columns to categorical values:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0.3" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_0.3">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df2 = df.factorize("state");
    df2 ==>
    +---+----+------+-----+----------+---+------+
    |   |name|gender|state|  birthday|age|salary|
    +---+----+------+-----+----------+---+------+
    |  0|Alex|  Male|   NY|1980-10-01| 38| 10000|
    |  1| Bob|  Male|   AZ|1995-03-04| 23|  null|
    |  2|Jane|Female|   CA|1970-03-01| 48|230000|
    |  3| Amy|Female|   NY|2005-12-10| 13|  null|
    +---+----+------+-----+----------+---+------+

    smile> df.get(0, 2)
    $25 ==> "NY"
    smile> df2.get(0, 2)
    $27 ==> 2
    </code></pre>
            </div>
        </div>
    </div>
    <p>On the surface, everything seems same. But the column 'state' is actually converted
    to integral values under the hood.</p>

    <p>Smile provides many parsers for popular data formats. In fact, the output of most Smile
        data parsers is a DataFrame.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_1" data-toggle="tab">Java</a></li>
        <li><a href="#scala_1" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_1" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val iris = read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    iris: DataFrame =
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |  1|        4.9|         3|        1.4|       0.2|Iris-setosa|
    |  2|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |  3|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |  4|          5|       3.6|        1.4|       0.2|Iris-setosa|
    |  5|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |  6|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |  7|          5|       3.4|        1.5|       0.2|Iris-setosa|
    |  8|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |  9|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    140 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> import smile.data.*
           import smile.datasets.*
           import smile.io.*
           var iris = Read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    $3 ==>
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |  1|        4.9|         3|        1.4|       0.2|Iris-setosa|
    |  2|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |  3|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |  4|          5|       3.6|        1.4|       0.2|Iris-setosa|
    |  5|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |  6|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |  7|          5|       3.4|        1.5|       0.2|Iris-setosa|
    |  8|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |  9|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    140 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane" id="kotlin_1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> import smile.*
        import smile.data.*
        import smile.io.*
        val iris = Read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    >>> iris
    res3: smile.data.DataFrame! =
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |  1|        4.9|         3|        1.4|       0.2|Iris-setosa|
    |  2|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |  3|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |  4|          5|       3.6|        1.4|       0.2|Iris-setosa|
    |  5|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |  6|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |  7|          5|       3.4|        1.5|       0.2|Iris-setosa|
    |  8|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |  9|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    140 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <p>The <code>smile.datasets</code> package also provide many public datasets.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_1.1" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_1.1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>

    smile> var iris = new Iris().data() // use built-in dataset object
    [main] INFO smile.io.Arff - Read ARFF relation iris
    iris ==>
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |  1|        4.9|         3|        1.4|       0.2|Iris-setosa|
    |  2|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |  3|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |  4|          5|       3.6|        1.4|       0.2|Iris-setosa|
    |  5|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |  6|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |  7|          5|       3.4|        1.5|       0.2|Iris-setosa|
    |  8|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |  9|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    140 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Index">Indexing</h3>
    <p>We can set a row index to the data frames that must be of length the number
        of rows with no duplicates nor missing values. The row index serves as a
        unique identifier for each row in a DataFrame. It allows you to access and
        rows efficiently with the <code>loc()</code> method. Compared with the normal
        ordinal index, the object-based row index often contains certain semantic
        meaning.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.1" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.1">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df = DataFrame.of(MathEx.randn(6, 4))
           var dates = Dates.range(LocalDate.of(2025,2,1), 6)
           df = df.setIndex(dates)
    +----------+---------+---------+---------+---------+
    |          |       V1|       V2|       V3|       V4|
    +----------+---------+---------+---------+---------+
    |2025-02-01|-0.201469| 0.970363| 2.726932|-0.146014|
    |2025-02-02| 1.872161| 0.495932| 0.553859|-0.028237|
    |2025-02-03|-0.504866|-0.179409| 0.201377| 0.281267|
    |2025-02-04| 0.894446| 0.791521| 0.053346| 0.213519|
    |2025-02-05| 0.200011|-0.203736|-0.349196|-1.193759|
    |2025-02-06|  1.52529|-1.407597|  1.16758| -1.78291|
    +----------+---------+---------+---------+---------+
    smile> df.loc(dates[1])
    $8 ==> {
      V1: -0.133296,
      V2: -2.461161,
      V3: 0.25011,
      V4: 1.132062
    }

    smile> df.loc(dates[1], dates[2])
    $9 ==>
    +----------+---------+---------+---------+--------+
    |          |       V1|       V2|       V3|      V4|
    +----------+---------+---------+---------+--------+
    |2025-02-02|-0.133296|-2.461161|  0.25011|1.132062|
    |2025-02-03|  0.25248|-0.063054|-1.128157| 0.37634|
    +----------+---------+---------+---------+--------+
    </code></pre>
            </div>
        </div>
    </div>

    <p>The row index may be an existing column, which will be removed in the
        resulting data frame.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.2" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.2">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df = new DataFrame(
                   ValueVector.of("A", 1.0),
                   ValueVector.of("B", LocalDate.parse("2013-01-02")),
                   ValueVector.of("C", "foo"),
                   ObjectVector.of("D", Index.range(0, 4).toArray()),
                   ObjectVector.of("E", new int[]{3, 3, 3, 3})
           )
           df = df.setIndex("B")
    df ==>
    +----------+---+---+------------+------------+
    |          |  A|  C|           D|           E|
    +----------+---+---+------------+------------+
    |2013-01-02|  1|foo|[0, 1, 2, 3]|[3, 3, 3, 3]|
    +----------+---+---+------------+------------+
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Viewing">Viewing Data</h3>
    <p>By default, <code>DataFrame.toString()</code> returns a pretty print of
        top 10 rows. You may also use <code>DataFrame.head()</code> and
        <code>DataFrame.tail()</code> with a specified number to view the top
        and bottom rows of the frame respectively.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0.4" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_0.4">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> println(df.head(3))
    +----------+---------+---------+---------+---------+
    |          |       V1|       V2|       V3|       V4|
    +----------+---------+---------+---------+---------+
    |2025-02-01|-0.189794| 0.609897|-0.289189|-0.636956|
    |2025-02-02|-0.133296|-2.461161|  0.25011| 1.132062|
    |2025-02-03|  0.25248|-0.063054|-1.128157|  0.37634|
    +----------+---------+---------+---------+---------+
    3 more rows...

    smile> println(df.tail(3))
    +----------+---------+---------+---------+--------+
    |          |       V1|       V2|       V3|      V4|
    +----------+---------+---------+---------+--------+
    |2025-02-04|-1.391481| 1.398828| 0.294973|1.353308|
    |2025-02-05| 2.812277|  0.82762|-0.294806|1.836631|
    |2025-02-06| 1.091213|-0.190432| 1.963064|0.725228|
    +----------+---------+---------+---------+--------+
    </code></pre>
            </div>
        </div>
    </div>

    <p>The method <code>describe()</code> shows the data structure and statistic summary:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_0.5" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_0.5">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> iris.describe()
    +---+-----------+-----+--------------------+-----+--------+--------+---+---+---+---+---+
    |   |     column| type|             measure|count|    mean|     std|min|25%|50%|75%|max|
    +---+-----------+-----+--------------------+-----+--------+--------+---+---+---+---+---+
    |  0|sepallength|float|                null|  150|5.843333|0.828066|4.3|5.1|5.8|6.4|7.9|
    |  1| sepalwidth|float|                null|  150|   3.054|0.433594|  2|2.8|  3|3.3|4.4|
    |  2|petallength|float|                null|  150|3.758667| 1.76442|  1|1.6|4.4|5.1|6.9|
    |  3| petalwidth|float|                null|  150|1.198667|0.763161|0.1|0.3|1.3|1.8|2.5|
    |  4|      class| byte|nominal[Iris-seto...|  150|     NaN|     NaN|  0|  0|  1|  2|  2|
    +---+-----------+-----+--------------------+-----+--------+--------+---+---+---+---+---+
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Selection">Selection</h3>
    <p>We can get a row with the array syntax or slice a subset of rows.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2" data-toggle="tab">Java</a></li>
        <li><a href="#scala_2" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_2" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_2">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> iris(0)
    res5: Tuple = {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }

    smile> iris(Index.range(10, 20))
    res6: DataFrame =
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
    |  1|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
    |  2|        4.8|         3|        1.4|       0.1|Iris-setosa|
    |  3|        4.3|         3|        1.1|       0.1|Iris-setosa|
    |  4|        5.8|         4|        1.2|       0.2|Iris-setosa|
    |  5|        5.7|       4.4|        1.5|       0.4|Iris-setosa|
    |  6|        5.4|       3.9|        1.3|       0.4|Iris-setosa|
    |  7|        5.1|       3.5|        1.4|       0.3|Iris-setosa|
    |  8|        5.7|       3.8|        1.7|       0.3|Iris-setosa|
    |  9|        5.1|       3.8|        1.5|       0.3|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_2">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> iris.get(0)
    $7 ==> {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }

    smile> iris.get(Index.range(10, 20))
    $8 ==>
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
    |  1|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
    |  2|        4.8|         3|        1.4|       0.1|Iris-setosa|
    |  3|        4.3|         3|        1.1|       0.1|Iris-setosa|
    |  4|        5.8|         4|        1.2|       0.2|Iris-setosa|
    |  5|        5.7|       4.4|        1.5|       0.4|Iris-setosa|
    |  6|        5.4|       3.9|        1.3|       0.4|Iris-setosa|
    |  7|        5.1|       3.5|        1.4|       0.3|Iris-setosa|
    |  8|        5.7|       3.8|        1.7|       0.3|Iris-setosa|
    |  9|        5.1|       3.8|        1.5|       0.3|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_2">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> iris[0]
    res6: smile.data.Tuple! = {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }

    >>> iris.get(Index.range(10, 20))
    res7: smile.data.DataFrame! =
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
    |  1|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
    |  2|        4.8|         3|        1.4|       0.1|Iris-setosa|
    |  3|        4.3|         3|        1.1|       0.1|Iris-setosa|
    |  4|        5.8|         4|        1.2|       0.2|Iris-setosa|
    |  5|        5.7|       4.4|        1.5|       0.4|Iris-setosa|
    |  6|        5.4|       3.9|        1.3|       0.4|Iris-setosa|
    |  7|        5.1|       3.5|        1.4|       0.3|Iris-setosa|
    |  8|        5.7|       3.8|        1.7|       0.3|Iris-setosa|
    |  9|        5.1|       3.8|        1.5|       0.3|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    </code></pre>
    </div>
    </div>
    </div>

    <p>Similarly, we can refer a column by its name or select a few columns to create a new data frame.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_3" data-toggle="tab">Java</a></li>
        <li><a href="#scala_3" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_3" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_3">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> iris("sepallength")
    res6: vector.ValueVector = sepallength[5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, ..., 140 more]

    smile> iris("sepallength", "sepalwidth")
    res8: DataFrame =
    +---+-----------+----------+
    |   |sepallength|sepalwidth|
    +---+-----------+----------+
    |  0|        5.1|       3.5|
    |  1|        4.9|         3|
    |  2|        4.7|       3.2|
    |  3|        4.6|       3.1|
    |  4|          5|       3.6|
    |  5|        5.4|       3.9|
    |  6|        4.6|       3.4|
    |  7|          5|       3.4|
    |  8|        4.4|       2.9|
    |  9|        4.9|       3.1|
    +---+-----------+----------+
    140 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_3">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> iris.column("sepallength")
    $8 ==> sepallength[5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, ..., 140 more]

    smile> iris.select("sepallength", "sepalwidth")
    $9 ==>
    +---+-----------+----------+
    |   |sepallength|sepalwidth|
    +---+-----------+----------+
    |  0|        5.1|       3.5|
    |  1|        4.9|         3|
    |  2|        4.7|       3.2|
    |  3|        4.6|       3.1|
    |  4|          5|       3.6|
    |  5|        5.4|       3.9|
    |  6|        4.6|       3.4|
    |  7|          5|       3.4|
    |  8|        4.4|       2.9|
    |  9|        4.9|       3.1|
    +---+-----------+----------+
    140 more rows...
          </code></pre>
            </div>
        </div>
        <div class="tab-pane" id="kotlin_3">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> iris.column("sepallength")
    res7: smile.data.vector.ValueVecotr! = sepallength[5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, ..., 140 more]

    >>> iris.select("sepallength", "sepalwidth")
    res8: smile.data.DataFrame! =
    +---+-----------+----------+
    |   |sepallength|sepalwidth|
    +---+-----------+----------+
    |  0|        5.1|       3.5|
    |  1|        4.9|         3|
    |  2|        4.7|       3.2|
    |  3|        4.6|       3.1|
    |  4|          5|       3.6|
    |  5|        5.4|       3.9|
    |  6|        4.6|       3.4|
    |  7|          5|       3.4|
    |  8|        4.4|       2.9|
    |  9|        4.9|       3.1|
    +---+-----------+----------+
    140 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <p>We can also select rows with boolean indexing. The below example uses
        <code>isin()</code> method for filtering:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.3" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.3">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> iris.get(iris.column("class").isin("Iris-setosa", "Iris-virginica"))
    $10 ==>
    +---+-----------+----------+-----------+----------+-----------+
    |   |sepallength|sepalwidth|petallength|petalwidth|      class|
    +---+-----------+----------+-----------+----------+-----------+
    |  0|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |  1|        4.9|         3|        1.4|       0.2|Iris-setosa|
    |  2|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |  3|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |  4|          5|       3.6|        1.4|       0.2|Iris-setosa|
    |  5|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |  6|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |  7|          5|       3.4|        1.5|       0.2|Iris-setosa|
    |  8|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |  9|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +---+-----------+----------+-----------+----------+-----------+
    90 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Setting">Setting</h3>
    <p>Setting values by position:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.4" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.4">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> iris.set(0, 0, 1.5)
    </code></pre>
            </div>
        </div>
    </div>

    <p>Adding columns:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.5" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.5">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df = DataFrame.of(MathEx.randn(150, 3))
           iris.add(df.column("V1"), df.column("V3"))
    </code></pre>
            </div>
        </div>
    </div>

    <p>Setting a column:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.6" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.6">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> iris.set("V1", df.column("V3"))
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Merge">Merge</h3>
    <p>The <code>merge()</code> method can combine data frames horizontally:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.7" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.7">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df3 = iris.merge(df)
    </code></pre>
            </div>
        </div>
    </div>

    <p>To concatenate data frames vertically, use <code>concat()</code>:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.8" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.8">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var iris2 = iris.concat(iris)
    </code></pre>
            </div>
        </div>
    </div>

    <p>To join two data frames on their index:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.9" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.9">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    var dates = Dates.range(LocalDate.of(2025,2,1), 6);
    var df1 = DataFrame.of(MathEx.randn(6, 4)).setIndex(dates);
    var df2 = DataFrame.of(MathEx.randn(6, 4)).setIndex(dates);
    var df = df1.join(df2);
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Missing">Missing Data</h3>
    <p>For object column, <code>null</code> indicates missing data. For primitive
        columns, Smile maintains a bit mask to indicate missing values. However,
        it is conventional that users use <code>NaN</code> to represent missing data
        for floating numeric types. One may use <code>DataFrame.isNullAt(i, j)</code>
        to check if a cell is null or NaN.</p>

    <p><code>DataFrame.dropna()</code> drops any rows that have null or missing value:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.11" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.11">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> var df = DataFrame.of(MathEx.randn(6, 4))
           df.set(0, 0, Double.NaN)
           df.set(1, 3, Double.NaN)
           df.dropna()
    +---+---------+---------+---------+---------+
    |   |       V1|       V2|       V3|       V4|
    +---+---------+---------+---------+---------+
    |  0|-0.504866|-0.179409| 0.201377| 0.281267|
    |  1| 0.894446| 0.791521| 0.053346| 0.213519|
    |  2| 0.200011|-0.203736|-0.349196|-1.193759|
    |  3|  1.52529|-1.407597|  1.16758| -1.78291|
    +---+---------+---------+---------+---------+
    </code></pre>
            </div>
        </div>
    </div>
    <p><code>DataFrame.fillna()</code> fills missing data in numeric columns:</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_2.10" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_2.10">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> df.fillna(100)
    +---+---------+---------+---------+---------+
    |   |       V1|       V2|       V3|       V4|
    +---+---------+---------+---------+---------+
    |  0|      100| 0.970363| 2.726932|-0.146014|
    |  1| 1.872161| 0.495932| 0.553859|      100|
    |  2|-0.504866|-0.179409| 0.201377| 0.281267|
    |  3| 0.894446| 0.791521| 0.053346| 0.213519|
    |  4| 0.200011|-0.203736|-0.349196|-1.193759|
    |  5|  1.52529|-1.407597|  1.16758| -1.78291|
    +---+---------+---------+---------+---------+
    </code></pre>
            </div>
        </div>
    </div>

    <h3 id="DataFrame_Operations">Operations</h3>
    <p>Advanced operations such as <code>exists</code>, <code>forall</code>,
        <code>find</code>, <code>filter</code> are also supported. In Java API,
        all these operations are on <code>Stream</code>. The corresponding methods
        are <code>anyMatch</code>, <code>allMatch</code>, <code>findAny</code>,
        and <code>filter</code>.
        The <code>predicate</code> of these functions expect a <code>Tuple</code></p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_4" data-toggle="tab">Java</a></li>
        <li><a href="#scala_4" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_4" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_4">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile&gt; iris.exists(_.getDouble(0) > 4.5)
    res16: Boolean = true

    smile&gt; iris.forall(_.getDouble(0) < 10)
    res17: Boolean = true

    smile&gt; iris.find(_("class") == 1)
    res18: java.util.Optional[Tuple] = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile&gt; iris.find(_.getString("class").equals("Iris-versicolor"))
    res19: java.util.Optional[Tuple] = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile&gt; iris.filter { row => row.getDouble(1) > 3 && row("class") != 0 }
    res20: DataFrame =
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |        5.9|       3.2|        4.8|       1.8|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_4">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> iris.stream().anyMatch(row -> row.getDouble(0) > 4.5)
    $14 ==> true

    smile> iris.stream().allMatch(row -> row.getDouble(0) < 10)
    $15 ==> true

    smile> iris.stream().filter(row -> row.getByte("class") == 1).findAny()
    $17 ==> Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> iris.stream().filter(row -> row.getString("class").equals("Iris-versicolor")).findAny()
    $18 ==> Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> var stream = iris.stream().filter(row -> row.getDouble(1) > 3 && row.getByte("class") != 0)
           DataFrame.of(iris.schema(), stream)
    $20 ==>
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_4">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> iris.stream().anyMatch({row -> row.getDouble(0) > 4.5})
    res10: kotlin.Boolean = true
    >>> iris.stream().allMatch({row -> row.getDouble(0) < 10})
    res11: kotlin.Boolean = true
    >>> iris.stream().filter({row -> row.getByte("class") == 1.toByte()}).findAny()
    res14: java.util.Optional&lt;smile.data.Tuple!&gt;! = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]
    >>> iris.stream().filter({row -> row.getString("class").equals("Iris-versicolor")}).findAny()
    res15: java.util.Optional&lt;smile.data.Tuple!&gt;! = Optional[{
      sepallength: 5.4,
      sepalwidth: 3,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }]
    >>> DataFrame.of(iris.stream().filter({row -> row.getDouble(1) > 3 && row.getByte("class") != 0.toByte()}))
    res22: smile.data.DataFrame! =
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |        5.9|       3.2|        4.8|       1.8|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <p>For data wrangling, the most important functions of <code>DataFrame</code>
        are <code>map</code> and <code>groupBy</code>.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_6" data-toggle="tab">Java</a></li>
        <li><a href="#scala_6" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_6" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_6">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile&gt; iris.map { row =>
                    val x = new Array[Double](6)
                    for (i <- 0 until 4) x(i) = row.getDouble(i)
                    x(4) = x(0) * x(1)
                    x(5) = x(2) * x(3)
                    x
                  }
    res22: Iterable[Array[Double]] = ArrayBuffer(
      Array(5.1, 3.5, 1.4, 0.2, 17.849999999999998, 0.27999999999999997),
      Array(4.9, 3.0, 1.4, 0.2, 14.700000000000001, 0.27999999999999997),
      Array(4.7, 3.2, 1.3, 0.2, 15.040000000000001, 0.26),
      Array(4.6, 3.1, 1.5, 0.2, 14.26, 0.30000000000000004),
      Array(5.0, 3.6, 1.4, 0.2, 18.0, 0.27999999999999997),
      Array(5.4, 3.9, 1.7, 0.4, 21.060000000000002, 0.68),
      Array(4.6, 3.4, 1.4, 0.3, 15.639999999999999, 0.42),
      Array(5.0, 3.4, 1.5, 0.2, 17.0, 0.30000000000000004),
      Array(4.4, 2.9, 1.4, 0.2, 12.76, 0.27999999999999997),
      Array(4.9, 3.1, 1.5, 0.1, 15.190000000000001, 0.15000000000000002),
      Array(5.4, 3.7, 1.5, 0.2, 19.980000000000004, 0.30000000000000004),
      Array(4.8, 3.4, 1.6, 0.2, 16.32, 0.32000000000000006),
      Array(4.8, 3.0, 1.4, 0.1, 14.399999999999999, 0.13999999999999999),
      Array(4.3, 3.0, 1.1, 0.1, 12.899999999999999, 0.11000000000000001),
      Array(5.8, 4.0, 1.2, 0.2, 23.2, 0.24),
      Array(5.7, 4.4, 1.5, 0.4, 25.080000000000002, 0.6000000000000001),
      Array(5.4, 3.9, 1.3, 0.4, 21.060000000000002, 0.52),
      Array(5.1, 3.5, 1.4, 0.3, 17.849999999999998, 0.42),
      Array(5.7, 3.8, 1.7, 0.3, 21.66, 0.51),
      Array(5.1, 3.8, 1.5, 0.3, 19.38, 0.44999999999999996),
      Array(5.4, 3.4, 1.7, 0.2, 18.36, 0.34),
      Array(5.1, 3.7, 1.5, 0.4, 18.87, 0.6000000000000001),
      Array(4.6, 3.6, 1.0, 0.2, 16.56, 0.2),
      Array(5.1, 3.3, 1.7, 0.5, 16.83, 0.85),
    ...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_6">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> var x6 = iris.stream().map(row -> {
               var x = new double[6];
               for (int i = 0; i < 4; i++) x[i] = row.getDouble(i);
               x[4] = x[0] * x[1];
               x[5] = x[2] * x[3];
               return x;
           })
    x6 ==> java.util.stream.ReferencePipeline$3@32eff876

    smile> x6.forEach(xi -> System.out.println(Arrays.toString(xi)))
    [6.199999809265137, 2.9000000953674316, 4.300000190734863, 1.2999999523162842, 17.980000038146954, 5.590000042915335]
    [7.300000190734863, 2.9000000953674316, 6.300000190734863, 1.7999999523162842, 21.170001249313373, 11.340000042915335]
    [7.699999809265137, 3.0, 6.099999904632568, 2.299999952316284, 23.09999942779541, 14.029999489784245]
    [6.699999809265137, 2.5, 5.800000190734863, 1.7999999523162842, 16.749999523162842, 10.440000066757193]
    [7.199999809265137, 3.5999999046325684, 6.099999904632568, 2.5, 25.919998626709003, 15.249999761581421]
    [6.5, 3.200000047683716, 5.099999904632568, 2.0, 20.800000309944153, 10.199999809265137]
    [6.400000095367432, 2.700000047683716, 5.300000190734863, 1.899999976158142, 17.28000056266785, 10.070000236034389]
    [5.699999809265137, 2.5999999046325684, 3.5, 1.0, 14.819998960495013, 3.5]
    [4.599999904632568, 3.5999999046325684, 1.0, 0.20000000298023224, 16.55999921798707, 0.20000000298023224]
    [5.400000095367432, 3.0, 4.5, 1.5, 16.200000286102295, 6.75]
    [6.699999809265137, 3.0999999046325684, 4.400000095367432, 1.399999976158142, 20.76999876976015, 6.160000028610227]
    [5.099999904632568, 3.799999952316284, 1.600000023841858, 0.20000000298023224, 19.379999394416814, 0.32000000953674324]
    [5.599999904632568, 3.0, 4.5, 1.5, 16.799999713897705, 6.75]
    [6.0, 3.4000000953674316, 4.5, 1.600000023841858, 20.40000057220459, 7.200000107288361]
    [5.099999904632568, 3.299999952316284, 1.7000000476837158, 0.5, 16.82999944210053, 0.8500000238418579]
    [5.5, 2.4000000953674316, 3.799999952316284, 1.100000023841858, 13.200000524520874, 4.1800000381469715]
    [7.099999904632568, 3.0, 5.900000095367432, 2.0999999046325684, 21.299999713897705, 12.38999963760375]
    [6.300000190734863, 3.4000000953674316, 5.599999904632568, 2.4000000953674316, 21.420001249313373, 13.440000305175772]
    [5.099999904632568, 2.5, 3.0, 1.100000023841858, 12.749999761581421, 3.3000000715255737]
    [6.400000095367432, 3.0999999046325684, 5.5, 1.7999999523162842, 19.839999685287466, 9.899999737739563]
    [6.300000190734863, 2.9000000953674316, 5.599999904632568, 1.7999999523162842, 18.27000115394594, 10.079999561309819]
    [5.5, 2.4000000953674316, 3.700000047683716, 1.0, 13.200000524520874, 3.700000047683716]
    [6.5, 3.0, 5.800000190734863, 2.200000047683716, 19.5, 12.76000069618226]
    [7.599999904632568, 3.0, 6.599999904632568, 2.0999999046325684, 22.799999713897705, 13.859999170303354]
    [4.900000095367432, 2.5, 4.5, 1.7000000476837158, 12.250000238418579, 7.650000214576721]
    [5.0, 2.299999952316284, 3.299999952316284, 1.0, 11.499999761581421, 3.299999952316284]
    [5.599999904632568, 2.700000047683716, 4.199999809265137, 1.2999999523162842, 15.120000009536739, 5.45999955177308]
    ...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_6">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
    >>>  val x6 = iris.stream().map({row ->
    ...            val x = DoubleArray(6)
    ...            for (i in 0..3) x[i] = row.getDouble(i)
    ...            x[4] = x[0] * x[1]
    ...            x[5] = x[2] * x[3]
    ...            x
    ...        })
    >>> x6.forEach({xi: DoubleArray -> println(java.util.Arrays.toString(xi))})
    [5.699999809265137, 2.5999999046325684, 3.5, 1.0, 14.819998960495013, 3.5]
    [6.699999809265137, 3.0999999046325684, 4.400000095367432, 1.399999976158142, 20.76999876976015, 6.160000028610227]
    [5.400000095367432, 3.0, 4.5, 1.5, 16.200000286102295, 6.75]
    [5.5, 2.4000000953674316, 3.799999952316284, 1.100000023841858, 13.200000524520874, 4.1800000381469715]
    [5.599999904632568, 3.0, 4.5, 1.5, 16.799999713897705, 6.75]
    [4.900000095367432, 3.0999999046325684, 1.5, 0.10000000149011612, 15.189999828338614, 0.15000000223517418]
    [4.599999904632568, 3.5999999046325684, 1.0, 0.20000000298023224, 16.55999921798707, 0.20000000298023224]
    [7.699999809265137, 3.0, 6.099999904632568, 2.299999952316284, 23.09999942779541, 14.029999489784245]
    [5.400000095367432, 3.700000047683716, 1.5, 0.20000000298023224, 19.980000610351567, 0.30000000447034836]
    [5.800000190734863, 2.700000047683716, 4.099999904632568, 1.0, 15.660000791549692, 4.099999904632568]
    [6.300000190734863, 3.4000000953674316, 5.599999904632568, 2.4000000953674316, 21.420001249313373, 13.440000305175772]
    [6.0, 3.4000000953674316, 4.5, 1.600000023841858, 20.40000057220459, 7.200000107288361]
    [6.199999809265137, 2.200000047683716, 4.5, 1.5, 13.63999987602233, 6.75]
    [6.400000095367432, 3.0999999046325684, 5.5, 1.7999999523162842, 19.839999685287466, 9.899999737739563]
    [6.699999809265137, 3.0999999046325684, 4.699999809265137, 1.5, 20.76999876976015, 7.049999713897705]
    [5.5, 2.4000000953674316, 3.700000047683716, 1.0, 13.200000524520874, 3.700000047683716]
    [5.099999904632568, 3.799999952316284, 1.600000023841858, 0.20000000298023224, 19.379999394416814, 0.32000000953674324]
    [6.199999809265137, 2.9000000953674316, 4.300000190734863, 1.2999999523162842, 17.980000038146954, 5.590000042915335]
    [6.300000190734863, 2.299999952316284, 4.400000095367432, 1.2999999523162842, 14.490000138282767, 5.719999914169307]
    [5.800000190734863, 2.700000047683716, 3.9000000953674316, 1.2000000476837158, 15.660000791549692, 4.680000300407414]
    [6.0, 3.0, 4.800000190734863, 1.7999999523162842, 18.0, 8.640000114440909]
    [5.599999904632568, 2.5, 3.9000000953674316, 1.100000023841858, 13.999999761581421, 4.290000197887423]
    [4.800000190734863, 3.4000000953674316, 1.600000023841858, 0.20000000298023224, 16.320001106262225, 0.32000000953674324]
    [6.900000095367432, 3.0999999046325684, 5.400000095367432, 2.0999999046325684, 21.38999963760375, 11.339999685287466]
    [5.900000095367432, 3.200000047683716, 4.800000190734863, 1.7999999523162842, 18.88000058650971, 8.640000114440909]
    [4.800000190734863, 3.0, 1.399999976158142, 0.10000000149011612, 14.40000057220459, 0.13999999970197674]
    [5.099999904632568, 3.299999952316284, 1.7000000476837158, 0.5, 16.82999944210053, 0.8500000238418579]
    [6.099999904632568, 2.799999952316284, 4.0, 1.2999999523162842, 17.07999944210053, 5.199999809265137]
    [7.900000095367432, 3.799999952316284, 6.400000095367432, 2.0, 30.01999998569488, 12.800000190734863]
    [6.0, 2.700000047683716, 5.099999904632568, 1.600000023841858, 16.200000286102295, 8.159999969005582]
    [6.400000095367432, 2.799999952316284, 5.599999904632568, 2.200000047683716, 17.919999961853023, 12.320000057220454]
    [6.599999904632568, 3.0, 4.400000095367432, 1.399999976158142, 19.799999713897705, 6.160000028610227]
    ...
    </code></pre>
    </div>
    </div>
    </div>

    <p>The <code>groupBy</code> operation groups elements according to a classification
        function, and returning the results in a <code>Map</code>. The classification
        function maps elements to some key type <code>K</code>. The collector produces
        a map whose keys are the values resulting from applying the classification
        function to the input elements, and whose corresponding values are Lists
        containing the input elements which map to the associated key under the
        classification function.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_7" data-toggle="tab">Java</a></li>
        <li><a href="#scala_7" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_7" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_7">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> iris.groupBy(row => row.getString("class"))
    res23: Map[String, DataFrame] = Map(
      "Iris-virginica" ->
    +-----------+----------+-----------+----------+--------------+
    |sepallength|sepalwidth|petallength|petalwidth|         class|
    +-----------+----------+-----------+----------+--------------+
    |        6.3|       3.3|          6|       2.5|Iris-virginica|
    |        5.8|       2.7|        5.1|       1.9|Iris-virginica|
    |        7.1|         3|        5.9|       2.1|Iris-virginica|
    |        6.3|       2.9|        5.6|       1.8|Iris-virginica|
    |        6.5|         3|        5.8|       2.2|Iris-virginica|
    |        7.6|         3|        6.6|       2.1|Iris-virginica|
    |        4.9|       2.5|        4.5|       1.7|Iris-virginica|
    |        7.3|       2.9|        6.3|       1.8|Iris-virginica|
    |        6.7|       2.5|        5.8|       1.8|Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5|Iris-virginica|
    +-----------+----------+-----------+----------+--------------+
    40 more rows...
    ,
      "Iris-versicolor" ->
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    ...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_7">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> iris.stream().collect(java.util.stream.Collectors.groupingBy(row -> row.getString("class")))
    $24 ==> {Iris-versicolor=[{
      sepallength: 7,
      sepalwidth: 3.2,
      petallength: 4.7,
      petalwidth: 1.4,
      class: Iris-versicolor
    }, {
      sepallength: 6.4,
      sepalwidth: 3.2,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 6.9,
      sepalwidth: 3.1,
      petallength: 4.9,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.5,
      sepalwidth: 2.3,
      petallength: 4,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.5,
      sepalwidth: 2.8,
      petallength: 4.6,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.7,
      sepalwidth: 2.8,
      petallength: 4.5,
      petalwidth: 1.3,
      class: Iris-versicolor
    },  ...  class: Iris-setosa
    }, {
      sepallength: 4.6,
      sepalwidth: 3.2,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }, {
      sepallength: 5.3,
      sepalwidth: 3.7,
      petallength: 1.5,
      petalwidth: 0.2,
      class: Iris-setosa
    }, {
      sepallength: 5,
      sepalwidth: 3.3,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }]}
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_7">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> iris.stream().collect(java.util.stream.Collectors.groupingBy({row: Tuple -> row.getString("class")}))
    res98: kotlin.collections.(Mutable)Map&lt;kotlin.String!, kotlin.collections.(Mutable)List&lt;smile.data.Tuple!&gt;!&gt;! = {Iris-versicolor=[{
      sepallength: 7,
      sepalwidth: 3.2,
      petallength: 4.7,
      petalwidth: 1.4,
      class: Iris-versicolor
    }, {
      sepallength: 6.4,
      sepalwidth: 3.2,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 6.9,
      sepalwidth: 3.1,
      petallength: 4.9,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.5,
      sepalwidth: 2.3,
      petallength: 4,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.5,
      sepalwidth: 2.8,
      petallength: 4.6,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.7,
      sepalwidth: 2.8,
      petallength: 4.5,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.3,
      sepalwidth: 3.3,
      petallength: 4.7,
      petalwidth: 1.6,
      class: Iris-versicolor
    }, {
      sepallength: 4.9,
      sepalwidth: 2.4,
      petallength: 3.3,
      petalwidth: 1,
      class: Iris-versicolor
    }, {
    ...
    </code></pre>
    </div>
    </div>
    </div>

    <h2 id="SQL">SQL</h2>

    <p>While Smile provides many imperative way to manipulate DataFrames as showned above, it is probably
        easier to do so with SQL.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_26" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_26">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    smile> SQL sql = new SQL();
           sql.parquet("user", "data/kylo/userdata1.parquet");
           sql.json("books", "data/kylo/books_array.json");
           sql.csv("gdp", "data/regression/gdp.csv");
           sql.csv("diabetes", "data/regression/diabetes.csv");

           var tables = sql.tables();
    tables ==>
    +----------+-------+
    |TABLE_NAME|REMARKS|
    +----------+-------+
    |     books|   null|
    |  diabetes|   null|
    |       gdp|   null|
    |      user|   null|
    +----------+-------+

    smile> var columns = sql.describe("user");
    columns ==>
    +-----------------+---------+-----------+
    |      COLUMN_NAME|TYPE_NAME|IS_NULLABLE|
    +-----------------+---------+-----------+
    |registration_dttm|TIMESTAMP|        YES|
    |               id|  INTEGER|        YES|
    |       first_name|  VARCHAR|        YES|
    |        last_name|  VARCHAR|        YES|
    |            email|  VARCHAR|        YES|
    |           gender|  VARCHAR|        YES|
    |       ip_address|  VARCHAR|        YES|
    |               cc|  VARCHAR|        YES|
    |          country|  VARCHAR|        YES|
    |        birthdate|  VARCHAR|        YES|
    +-----------------+---------+-----------+
    3 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <p>In the above, we create a database and create four tables by loading parquet, json, and csv files.
        We also use the <code>describe</code> function to obtain the schema of the table user. With SQL,
        it is easy to filter data and the result is a DataFrame.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_27" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_27">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var user = sql.query("SELECT * FROM user WHERE country = 'China'");
[main] INFO smile.data.SQL - SELECT * FROM user WHERE country = 'China'
user ==>
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
|  registration_dttm| id|first_name| last_name|               email|gender|     ip_address|                 cc|country|birthdate|   salary|               title|comments|
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
|2016-02-03T00:36:21|  4|    Denise|     Riley|    driley3@gmpg.org|Female|  140.35.109.83|   3576031598965625|  China| 4/8/1997| 90263.05|Senior Cost Accou...|        |
|2016-02-03T18:04:34| 12|     Alice|     Berry|aberryb@wikipedia...|Female| 246.225.12.189|   4917830851454417|  China|8/12/1968| 22944.53|    Quality Engineer|        |
|2016-02-03T10:30:36| 20|   Rebecca|      Bell| rbellj@bandcamp.com|Female|172.215.104.127|                   |  China|         |137251.19|                    |        |
|2016-02-03T08:41:26| 27|     Henry|     Henry| hhenryq@godaddy.com|  Male| 191.88.236.116|4905730021217853521|  China|9/22/1995|284300.15|Nuclear Power Eng...|        |
|2016-02-03T20:46:39| 37|   Dorothy|     Gomez|dgomez10@jiathis.com|Female| 65.111.200.146| 493684876859391834|  China|         | 57194.86|                    |        |
|2016-02-03T08:34:26| 43|    Amanda|      Gray|  agray16@cdbaby.com|Female| 252.20.193.145|   3561501596653859|  China|8/28/1967|213410.26|Senior Quality En...|        |
|2016-02-03T00:05:52| 53|     Ralph|     Price|  rprice1g@tmall.com|  Male|   152.6.235.33|   4844227560658222|  China|8/26/1986| 168208.4|             Teacher|        |
|2016-02-03T16:03:13| 55|      Anna|Montgomery|amontgomery1i@goo...|Female|  80.111.141.47|   3586860392406446|  China| 9/6/1957|  92837.5|Software Test Eng...|     1E2|
|2016-02-03T00:33:25| 57|    Willie|    Palmer|wpalmer1k@t-onlin...|  Male| 164.107.46.161|   4026614769857244|  China|8/23/1986|184978.64|Environmental Spe...|        |
|2016-02-03T05:55:57| 58|    Arthur|     Berry|    aberry1l@unc.edu|  Male|    52.42.24.55|   3542761473624274|  China|         |144164.88|                    |        |
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
179 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <p>Of course, join is very useful to prepare data from multiple sources. The result DataFrame
        may be feed to downstream machine learning algorithms.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_28" data-toggle="tab">Java</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="java_28">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var gdp = sql.query("SELECT * FROM user LEFT JOIN gdp ON user.country = gdp.Country");
[main] INFO smile.data.SQL - SELECT * FROM user LEFT JOIN gdp ON user.country = gdp.Country
gdp ==>
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
|  registration_dttm| id|first_name|last_name|               email|gender|     ip_address|                cc|  country| birthdate|   salary|               title|            comments|  Country|GDP Growth| Debt|Interest|
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
|2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|    1.197.201.2|  6759521864920116|Indonesia|  3/8/1971| 49756.53|    Internal Auditor|               1E+02|Indonesia|       6.5| 26.2|     7.7|
|2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male| 218.111.175.34|                  |   Canada| 1/16/1968|150280.17|       Accountant IV|                    |   Canada|       2.5| 52.5|     9.5|
|2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female| 195.131.81.179|  3583136326049310|Indonesia| 2/25/1983| 69227.11|   Account Executive|                    |Indonesia|       6.5| 26.2|     7.7|
|2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male| 232.234.81.197|  3582641366974690| Portugal|12/18/1987| 14247.62|Senior Financial ...|                    | Portugal|      -1.6| 92.5|     9.7|
|2016-02-03T18:29:47| 10|     Emily|  Stewart|estewart9@opensou...|Female| 143.28.251.245|  3574254110301671|  Nigeria| 1/28/1997| 27234.28|     Health Coach IV|                    |  Nigeria|       7.4|    3|     6.6|
|2016-02-03T08:53:23| 15|   Dorothy|   Hudson|dhudsone@blogger.com|Female|       8.59.7.0|  3542586858224170|    Japan|12/20/1989|157099.71|  Nurse Practicioner|        alert('hi...|    Japan|      -0.6|174.8|    15.7|
|2016-02-03T00:44:01| 16|     Bruce|   Willis|bwillisf@bluehost...|  Male|239.182.219.189|  3573030625927601|   Brazil|          |239100.65|                    |                    |   Brazil|       2.7| 52.8|    24.1|
|2016-02-03T16:44:24| 18|   Stephen|  Wallace|swallaceh@netvibe...|  Male|  152.49.213.62|  5433943468526428|  Ukraine| 1/15/1978|248877.99|Account Represent...|                    |  Ukraine|       5.2| 27.4|     5.2|
|2016-02-03T18:50:55| 23|   Gregory|   Barnes|  gbarnesm@google.ru|  Male| 220.22.114.145|  3538432455620641|  Tunisia| 1/23/1971|182233.49|Senior Sales Asso...|         사회과학원 어학연구소|  Tunisia|        -2|   44|     5.8|
|2016-02-03T08:02:34| 26|   Anthony| Lawrence|alawrencep@miitbe...|  Male| 121.211.242.99|564182969714151470|    Japan|12/10/1979|170085.81| Electrical Engineer|                    |    Japan|      -0.6|174.8|    15.7|
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
990 more rows...
    </code></pre>
            </div>
        </div>
    </div>

    <h2 id="sparse">Sparse Dataset</h2>

    <p>The feature vectors could be very sparse. To save space, <a href="api/java/smile/data/SparseDataset.html">SparseDataset</a>
        stores data in a list of lists (LIL) sparse matrix format. SparseDataset stores one list
        per row, where each entry stores a column index and value. Typically, these entries
        are kept sorted by column index for faster lookup.</p>

    <p>SparseDataset is often used to construct the data matrix. Once the matrix is constructed,
        it is typically converted to a format, such as <a href="api/java/smile/math/matrix/SparseMatrix.html">Harwell-Boeing</a>
        column-compressed sparse matrix format, which is more efficient for matrix operations.</p>

    <p>The class <a href="api/java/smile/data/BinarySparseDataset.html">BinarySparseDataset</a> is more efficient for
        binary sparse data. In BinarySparseDataset, each item is stored as an integer array, which are
        the indices of nonzero elements in ascending order.</p>

    <h2 id="parser">Parsers</h2>

    <p>Smile provides a couple of parsers for popular data formats, such as Parquet, Avro, Arrow, SAS7BDAT, Weka's ARFF files,
        LibSVM's file format, delimited text files, JSON, and binary sparse data. We will demonstrate
        these parsers with the sample data in the <code>data</code> directory. In Scala API, the
        parsing functions are in the <code>smile.read</code> object.</p>

    <h3 id="read.parquet">Apache Parquet</h3>
    <p><a href="https://parquet.apache.org/">Apache Parquet</a>
        is a columnar storage format that supports
        nested data structures. It uses the record shredding and
        assembly algorithm described in the Dremel paper.</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_8" data-toggle="tab">Java</a></li>
        <li><a href="#scala_8" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_8" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_8">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
    smile> val df = read.parquet("data/kylo/userdata1.parquet")
    df: DataFrame =
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_8">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> var df = Read.parquet("data/kylo/userdata1.parquet")
    df ==>
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_8">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
    >>> val df = read.parquet("data/kylo/userdata1.parquet")
    >>> df
    res100: smile.data.DataFrame =
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |2016-02-03T03:52:53|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|                |         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |2016-02-03T18:29:47| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="read.avro">Apache Avro</h3>
    <p><a href="https://avro.apache.org/">Apache Avro</a>
        is a data serialization system.
        Avro provides rich data structures, a compact, fast, binary data format,
        a container file, to store persistent data, and remote procedure call (RPC).
        Avro relies on schemas. When Avro data is stored in a file, its schema
        is stored with it. Avro schemas are defined with JSON.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_9" data-toggle="tab">Java</a></li>
        <li><a href="#scala_9" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_9" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_9">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
    smile> val df = read.avro(Paths.getTestData("kylo/userdata1.avro"), Paths.getTestData("avro/userdata.avsc"))
    df: DataFrame =
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_9">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> var avrodf = Read.avro(smile.util.Paths.getTestData("kylo/userdata1.avro"), smile.util.Paths.getTestData("kylo/userdata.avsc"))
    avrodf ==>
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_9">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
    >>> val avrodf = read.avro(smile.util.Paths.getTestData("kylo/userdata1.avro"), smile.util.Paths.getTestData("kylo/userdata.avsc"))
    >>> avrodf
    res104: smile.data.DataFrame =
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |   registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29Z|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03Z|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|            null|              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31Z|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T12:36:21Z|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31Z|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34Z|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08Z|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06Z|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|            null|Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |2016-02-03T03:52:53Z|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|            null|         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |2016-02-03T18:29:47Z| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="read.arrow">Apache Arrow</h3>
    <p><a href="https://arrow.apache.org/">Apache Arrow</a>
        is a cross-language development platform for in-memory data.
        It specifies a standardized language-independent columnar memory format
        for flat and hierarchical data, organized for efficient analytic
        operations on modern hardware.</p>

    <p>Feather uses the Apache Arrow columnar memory specification to represent binary
        data on disk. This makes read and write operations very fast. This is particularly
        important for encoding null/NA values and variable-length types like UTF8 strings.
        Feather is a part of the broader Apache Arrow project. Feather defines its own
        simplified schemas and metadata for on-disk representation.</p>

    <p>In the below example, we write a DataFrame into Feather
        file and then read it back.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_10" data-toggle="tab">Java</a></li>
        <li><a href="#scala_10" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_10" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_10">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
    smile> val temp = java.io.File.createTempFile("chinook", "arrow")
    temp: java.io.File = /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5413820941564790310arrow

    smile> val path = temp.toPath()
    path: java.nio.file.Path = /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5413820941564790310arrow

    smile> write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows

    smile> val df = read.arrow(path)
    df: DataFrame =
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_10">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> var temp = java.io.File.createTempFile("chinook", "arrow")
    temp ==> /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5430879887643149276arrow

    smile> var path = temp.toPath()
    path ==> /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5430879887643149276arrow

    smile> Write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows

    smile> var arrowdf = Read.arrow(path)
    [main] INFO smile.io.Arrow - read 1000 rows and 13 columns
    arrowdf ==>
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_10">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
    >>> val temp = java.io.File.createTempFile("chinook", "arrow")
    >>> val path = temp.toPath()
    >>> write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows
    >>> val df = read.arrow(path)
    [main] INFO smile.io.Arrow - read 1000 rows and 13 columns
    >>> df
    res109: smile.data.DataFrame =
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |             null|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |             null|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |             null|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |             null|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |             null|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |             null|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |             null|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |             null|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |             null|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|                |         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |             null| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="read.sas">SAS7BDAT</h3>
    <p>SAS7BDAT is currently the main format
        used for storing SAS datasets across all platforms.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_11" data-toggle="tab">Java</a></li>
        <li><a href="#scala_11" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_11" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_11">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val df = read.sas(Paths.getTestData("sas/airline.sas7bdat"))
    df: DataFrame =
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_11">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> var sasdf = Read.sas("data/sas/airline.sas7bdat")
    sasdf ==>
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_11">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> val df = read.sas("data/sas/airline.sas7bdat")
    >>> df
    res112: smile.data.DataFrame =
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="read.jdbc">Relational Database</h3>
    <p>It is also easy to load data from relation databases through JDBC.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_12" data-toggle="tab">Java</a></li>
        <li><a href="#scala_12" data-toggle="tab">Scala</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_12">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
    smile> import $ivy.`org.xerial:sqlite-jdbc:3.28.0`
    import $ivy.$

    smile> Class.forName("org.sqlite.JDBC")
    res23: Class[?0] = class org.sqlite.JDBC

    smile> val url = String.format("jdbc:sqlite:%s", Paths.getTestData("sqlite/chinook.db").toAbsolutePath())
    url: String = "jdbc:sqlite:data/sqlite/chinook.db"
    smile> val sql = """select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
                     from employees as e
                     join customers as c on e.employeeid = c.supportrepid
                     join invoices as i on c.customerid = i.customerid
                    """
    sql: String = """select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
                     from employees as e
                     join customers as c on e.employeeid = c.supportrepid
                     join invoices as i on c.customerid = i.customerid
                    """

    smile> val conn = java.sql.DriverManager.getConnection(url)
    conn: java.sql.Connection = org.sqlite.jdbc4.JDBC4Connection@782cd00

    smile> val stmt = conn.createStatement()
    stmt: java.sql.Statement = org.sqlite.jdbc4.JDBC4Statement@40df1311

    smile> val rs = stmt.executeQuery(sql)
    rs: java.sql.ResultSet = org.sqlite.jdbc4.JDBC4ResultSet@5a524a19

    smile> val df = DataFrame.of(rs)
    df: DataFrame =
    +--------------+-------------+--------------+-------------+-------+-----+
    |Employee First|Employee Last|Customer First|Customer Last|Country|Total|
    +--------------+-------------+--------------+-------------+-------+-----+
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.96|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 5.94|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 0.99|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 1.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil|13.86|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 8.91|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 1.98|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany|13.86|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 8.91|
    +--------------+-------------+--------------+-------------+-------+-----+
    402 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_12">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
    smile> Class.forName("org.sqlite.JDBC")
    $1 ==> class org.sqlite.JDBC

    smile> var url = String.format("jdbc:sqlite:%s", smile.util.Paths.getTestData("sqlite/chinook.db").toAbsolutePath())
    url ==> "jdbc:sqlite:/Users/hli/github/smile/shell/target ... ../data/sqlite/chinook.db"

    smile> var sql = """
              select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
              from employees as e
              join customers as c on e.employeeid = c.supportrepid
              join invoices as i on c.customerid = i.customerid"""
    sql ==> "select e.firstname as 'Employee First', e.lastna ... ustomerid = i.customerid"

    smile> var conn = java.sql.DriverManager.getConnection(url)
    conn ==> org.sqlite.jdbc4.JDBC4Connection@1df82230

    smile> var stmt = conn.createStatement()
    stmt ==> org.sqlite.jdbc4.JDBC4Statement@75329a49

    smile> var rs = stmt.executeQuery(sql)
    rs ==> org.sqlite.jdbc4.JDBC4ResultSet@48aaecc3

    smile> var sqldf = DataFrame.of(rs)
    sqldf ==>
    +--------------+-------------+--------------+-------------+-------+-----+
    |Employee First|Employee Last|Customer First|Customer Last|Country|Total|
    +--------------+-------------+--------------+-------------+-------+-----+
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.96|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 5.94|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 0.99|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 1.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil|13.86|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 8.91|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 1.98|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany|13.86|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 8.91|
    +--------------+-------------+--------------+-------------+-------+-----+
    402 more rows...
          </code></pre>
            </div>
        </div>
    </div>

    <h3 id="read.arff">Weka ARFF</h3>
    <p><a href="https://www.cs.waikato.ac.nz/ml/weka/arff.html">Weka ARFF (attribute relation file format)</a>
        is an ASCII text file format that is essentially a CSV file with a header that describes the metadata.
        ARFF was developed for use in the <a href="https://www.cs.waikato.ac.nz/ml/weka/">Weka</a> machine learning software.</p>

    <p>A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology).
        Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified,
        together with their data type, each definition on a single line. The actual observations are then listed,
        each on a single line, with fields separated by commas, much like a CSV file.</p>

    <p>Missing values in an ARFF dataset are identified using the question mark '?'.
        Comments can be included in the file, introduced at the beginning of a line with a '%',
        whereby the remainder of the line is ignored.</p>

    <p>A significant advantage of the ARFF data file over the CSV data file is the metadata information.
        Also, the ability to include comments ensure we can record extra information about the data set,
        including how it was derived, where it came from, and how it might be cited.</p>

    <p>In the directory <code>data/weka</code>,
        we have many sample ARFF files. We can also read data from remote servers
        by HTTP, FTP, etc.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_13" data-toggle="tab">Java</a></li>
        <li><a href="#scala_13" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_13" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_13">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val df = read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    df: DataFrame =
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_13">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> var cpu = Read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    cpu ==>
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_13">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> val df = read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    >>> df
    res114: smile.data.DataFrame =
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="read.csv">Delimited Text and CSV</h3>
    <p>The delimited text files are widely used in machine learning research community.
        The comma-separated values (CSV) file is a special case. Smile provides flexible
        parser for them based on
        <a href="https://commons.apache.org/proper/commons-csv/">Apache Commons CSV</a> library.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_14" data-toggle="tab">Java</a></li>
        <li><a href="#scala_14" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_14" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_14">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    def csv(file: String, delimiter: Char = ',', header: Boolean = true, quote: Char = '"', escape: Char = '\\', schema: StructType = null): DataFrame
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_14">
            <p>In Java API, the user may provide a
                <a href="https://javadoc.io/doc/org.apache.commons/commons-csv/latest/index.html">CSVFormat</a>
                argument to specify the format of a CSV file.</p>
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    public interface Read {
        /** Reads a CSV file. */
        static DataFrame csv(String path) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(String path, CSVFormat format) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(String path, CSVFormat format, StructType schema) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(Path path) throws IOException

        /** Reads a CSV file. */
        static DataFrame csv(Path path, CSVFormat format) throws IOException

        /** Reads a CSV file. */
        static DataFrame csv(Path path, CSVFormat format, StructType schema) throws IOException
    }
          </code></pre>
            </div>
        </div>
        <div class="tab-pane" id="kotlin_14">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    fun csv(file: String, delimiter: Char = ',', header: Boolean = true, quote: Char = '"', escape: Char = '\\', schema: StructType? = null): DataFrame
    </code></pre>
            </div>
        </div>
    </div>

    <p>The parser tries it best to infer the schema of data from the top rows.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_15" data-toggle="tab">Java</a></li>
        <li><a href="#scala_15" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_15" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_15">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    val zip = read.csv("data/usps/zip.train", delimiter = ' ', header = false)
                </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_15">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    import org.apache.commons.csv.CSVFormat
    var format = CSVFormat.DEFAULT.withDelimiter(' ')
    var zip = Read.csv("data/usps/zip.train", format)
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_15">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    val zip = read.csv("data/usps/zip.train", delimiter = ' ', header = false)
    </code></pre>
    </div>
    </div>
    </div>

    <p>In case that the parser fails to infer the schema, the user may provide a predefined schema.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_16" data-toggle="tab">Java</a></li>
        <li><a href="#scala_16" data-toggle="tab">Scala</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_16">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> val airport = new NominalScale("ABE", "ABI", "ABQ", "ABY", "ACK", "ACT",
         "ACV", "ACY", "ADK", "ADQ", "AEX", "AGS", "AKN", "ALB", "ALO", "AMA", "ANC",
         "APF", "ASE", "ATL", "ATW", "AUS", "AVL", "AVP", "AZO", "BDL", "BET", "BFL",
         "BGM", "BGR", "BHM", "BIL", "BIS", "BJI", "BLI", "BMI", "BNA", "BOI", "BOS",
         "BPT", "BQK", "BQN", "BRO", "BRW", "BTM", "BTR", "BTV", "BUF", "BUR", "BWI",
         "BZN", "CAE", "CAK", "CDC", "CDV", "CEC", "CHA", "CHO", "CHS", "CIC", "CID",
         "CKB", "CLD", "CLE", "CLL", "CLT", "CMH", "CMI", "CMX", "COD", "COS", "CPR",
         "CRP", "CRW", "CSG", "CVG", "CWA", "CYS", "DAB", "DAL", "DAY", "DBQ", "DCA",
         "DEN", "DFW", "DHN", "DLG", "DLH", "DRO", "DSM", "DTW", "EAU", "EGE", "EKO",
         "ELM", "ELP", "ERI", "EUG", "EVV", "EWN", "EWR", "EYW", "FAI", "FAR", "FAT",
         "FAY", "FCA", "FLG", "FLL", "FLO", "FMN", "FNT", "FSD", "FSM", "FWA", "GEG",
         "GFK", "GGG", "GJT", "GNV", "GPT", "GRB", "GRK", "GRR", "GSO", "GSP", "GST",
         "GTF", "GTR", "GUC", "HDN", "HHH", "HKY", "HLN", "HNL", "HOU", "HPN", "HRL",
         "HSV", "HTS", "HVN", "IAD", "IAH", "ICT", "IDA", "ILG", "ILM", "IND", "INL",
         "IPL", "ISO", "ISP", "ITO", "IYK", "JAC", "JAN", "JAX", "JFK", "JNU", "KOA",
         "KTN", "LAN", "LAR", "LAS", "LAW", "LAX", "LBB", "LBF", "LCH", "LEX", "LFT",
         "LGA", "LGB", "LIH", "LIT", "LNK", "LRD", "LSE", "LWB", "LWS", "LYH", "MAF",
         "MBS", "MCI", "MCN", "MCO", "MDT", "MDW", "MEI", "MEM", "MFE", "MFR", "MGM",
         "MHT", "MIA", "MKE", "MLB", "MLI", "MLU", "MOB", "MOD", "MOT", "MQT", "MRY",
         "MSN", "MSO", "MSP", "MSY", "MTH", "MTJ", "MYR", "OAJ", "OAK", "OGD", "OGG",
         "OKC", "OMA", "OME", "ONT", "ORD", "ORF", "OTZ", "OXR", "PBI", "PDX", "PFN",
         "PHF", "PHL", "PHX", "PIA", "PIE", "PIH", "PIT", "PLN", "PMD", "PNS", "PSC",
         "PSE", "PSG", "PSP", "PUB", "PVD", "PVU", "PWM", "RAP", "RCA", "RDD", "RDM",
         "RDU", "RFD", "RHI", "RIC", "RNO", "ROA", "ROC", "ROW", "RST", "RSW", "SAN",
         "SAT", "SAV", "SBA", "SBN", "SBP", "SCC", "SCE", "SDF", "SEA", "SFO", "SGF",
         "SGU", "SHV", "SIT", "SJC", "SJT", "SJU", "SLC", "SLE", "SMF", "SMX", "SNA",
         "SOP", "SPI", "SPS", "SRQ", "STL", "STT", "STX", "SUN", "SUX", "SWF", "SYR",
         "TEX", "TLH", "TOL", "TPA", "TRI", "TTN", "TUL", "TUP", "TUS", "TVC", "TWF",
         "TXK", "TYR", "TYS", "VCT", "VIS", "VLD", "VPS", "WRG", "WYS", "XNA", "YAK",
         "YKM", "YUM")
airport: NominalScale = nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM]

smile> val schema = new StructType(
         new StructField("Month", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12")),
         new StructField("DayofMonth", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12", "c-13", "c-14", "c-15", "c-16", "c-17", "c-18",
           "c-19", "c-20", "c-21", "c-22", "c-23", "c-24", "c-25", "c-26", "c-27", "c-28", "c-29", "c-30", "c-31")),
         new StructField("DayOfWeek", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7")),
         new StructField("DepTime", DataTypes.IntType),
         new StructField("UniqueCarrier", DataTypes.ByteType, new NominalScale("9E", "AA", "AQ", "AS",
           "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", "YV")),
         new StructField("Origin", DataTypes.ShortType, airport),
         new StructField("Dest", DataTypes.ShortType, airport),
         new StructField("Distance", DataTypes.IntType),
         new StructField("dep_delayed_15min", DataTypes.ByteType, new NominalScale("N", "Y"))
       )
schema: StructType = [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12], DayofMonth: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12, c-13, c-14, c-15, c-16, c-17, c-18, c-19, c-20, c-21, c-22, c-23, c-24, c-25, c-26, c-27, c-28, c-29, c-30, c-31], DayOfWeek: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7], DepTime: int, UniqueCarrier: byte nominal[9E, AA, AQ, AS, B6, CO, DH, DL, EV, F9, FL, HA, HP, MQ, NW, OH, OO, TZ, UA, US, WN, XE, YV], Origin: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Dest: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Distance: int, dep_delayed_15min: byte nominal[N, Y]]

smile> val airline = read.csv("shell/src/universal/data/airline/train-1m.csv", schema = schema)
airline: DataFrame = [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12], DayofMonth: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12, c-13, c-14, c-15, c-16, c-17, c-18, c-19, c-20, c-21, c-22, c-23, c-24, c-25, c-26, c-27, c-28, c-29, c-30, c-31], DayOfWeek: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7], DepTime: int, UniqueCarrier: byte nominal[9E, AA, AQ, AS, B6, CO, DH, DL, EV, F9, FL, HA, HP, MQ, NW, OH, OO, TZ, UA, US, WN, XE, YV], Origin: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Dest: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Distance: int, dep_delayed_15min: byte nominal[N, Y]]
+-----+----------+---------+-------+-------------+------+----+--------+-----------------+
|Month|DayofMonth|DayOfWeek|DepTime|UniqueCarrier|Origin|Dest|Distance|dep_delayed_15min|
...
                </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_16">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> import smile.data.type.*
       import smile.data.measure.*
       var airport = new NominalScale("ABE", "ABI", "ABQ", "ABY", "ACK", "ACT",
          "ACV", "ACY", "ADK", "ADQ", "AEX", "AGS", "AKN", "ALB", "ALO", "AMA", "ANC",
          "APF", "ASE", "ATL", "ATW", "AUS", "AVL", "AVP", "AZO", "BDL", "BET", "BFL",
          "BGM", "BGR", "BHM", "BIL", "BIS", "BJI", "BLI", "BMI", "BNA", "BOI", "BOS",
          "BPT", "BQK", "BQN", "BRO", "BRW", "BTM", "BTR", "BTV", "BUF", "BUR", "BWI",
          "BZN", "CAE", "CAK", "CDC", "CDV", "CEC", "CHA", "CHO", "CHS", "CIC", "CID",
          "CKB", "CLD", "CLE", "CLL", "CLT", "CMH", "CMI", "CMX", "COD", "COS", "CPR",
          "CRP", "CRW", "CSG", "CVG", "CWA", "CYS", "DAB", "DAL", "DAY", "DBQ", "DCA",
          "DEN", "DFW", "DHN", "DLG", "DLH", "DRO", "DSM", "DTW", "EAU", "EGE", "EKO",
          "ELM", "ELP", "ERI", "EUG", "EVV", "EWN", "EWR", "EYW", "FAI", "FAR", "FAT",
          "FAY", "FCA", "FLG", "FLL", "FLO", "FMN", "FNT", "FSD", "FSM", "FWA", "GEG",
          "GFK", "GGG", "GJT", "GNV", "GPT", "GRB", "GRK", "GRR", "GSO", "GSP", "GST",
          "GTF", "GTR", "GUC", "HDN", "HHH", "HKY", "HLN", "HNL", "HOU", "HPN", "HRL",
          "HSV", "HTS", "HVN", "IAD", "IAH", "ICT", "IDA", "ILG", "ILM", "IND", "INL",
          "IPL", "ISO", "ISP", "ITO", "IYK", "JAC", "JAN", "JAX", "JFK", "JNU", "KOA",
          "KTN", "LAN", "LAR", "LAS", "LAW", "LAX", "LBB", "LBF", "LCH", "LEX", "LFT",
          "LGA", "LGB", "LIH", "LIT", "LNK", "LRD", "LSE", "LWB", "LWS", "LYH", "MAF",
          "MBS", "MCI", "MCN", "MCO", "MDT", "MDW", "MEI", "MEM", "MFE", "MFR", "MGM",
          "MHT", "MIA", "MKE", "MLB", "MLI", "MLU", "MOB", "MOD", "MOT", "MQT", "MRY",
          "MSN", "MSO", "MSP", "MSY", "MTH", "MTJ", "MYR", "OAJ", "OAK", "OGD", "OGG",
          "OKC", "OMA", "OME", "ONT", "ORD", "ORF", "OTZ", "OXR", "PBI", "PDX", "PFN",
          "PHF", "PHL", "PHX", "PIA", "PIE", "PIH", "PIT", "PLN", "PMD", "PNS", "PSC",
          "PSE", "PSG", "PSP", "PUB", "PVD", "PVU", "PWM", "RAP", "RCA", "RDD", "RDM",
          "RDU", "RFD", "RHI", "RIC", "RNO", "ROA", "ROC", "ROW", "RST", "RSW", "SAN",
          "SAT", "SAV", "SBA", "SBN", "SBP", "SCC", "SCE", "SDF", "SEA", "SFO", "SGF",
          "SGU", "SHV", "SIT", "SJC", "SJT", "SJU", "SLC", "SLE", "SMF", "SMX", "SNA",
          "SOP", "SPI", "SPS", "SRQ", "STL", "STT", "STX", "SUN", "SUX", "SWF", "SYR",
          "TEX", "TLH", "TOL", "TPA", "TRI", "TTN", "TUL", "TUP", "TUS", "TVC", "TWF",
          "TXK", "TYR", "TYS", "VCT", "VIS", "VLD", "VPS", "WRG", "WYS", "XNA", "YAK",
          "YKM", "YUM")
airport ==> nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, A ... , WYS, XNA, YAK, YKM, YUM]

smile> var schema = new StructType(
          new StructField("Month", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
            "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12")),
          new StructField("DayofMonth", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
            "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12", "c-13", "c-14", "c-15", "c-16", "c-17", "c-18",
            "c-19", "c-20", "c-21", "c-22", "c-23", "c-24", "c-25", "c-26", "c-27", "c-28", "c-29", "c-30", "c-31")),
          new StructField("DayOfWeek", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
            "c-5", "c-6", "c-7")),
          new StructField("DepTime", DataTypes.IntType),
          new StructField("UniqueCarrier", DataTypes.ByteType, new NominalScale("9E", "AA", "AQ", "AS",
            "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", "YV")),
          new StructField("Origin", DataTypes.ShortType, airport),
          new StructField("Dest", DataTypes.ShortType, airport),
          new StructField("Distance", DataTypes.IntType),
          new StructField("dep_delayed_15min", DataTypes.ByteType, new NominalScale("N", "Y"))
       )
schema ==> [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6 ... 15min: byte nominal[N, Y]]

smile> var format = CSVFormat.DEFAULT.withFirstRecordAsHeader();
format ==> Delimiter=<,> QuoteChar=<"> RecordSeparator=<
>  ... eaderRecord:true Header:[]

smile> var airline = Read.csv("data/airline/train-1m.csv", format, schema);
airline ==> [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6 ... ----+
999990 more rows...
          </code></pre>
            </div>
        </div>
    </div>

    <h3 id="read.libsvm">LibSVM</h3>
    <p>LibSVM is a very fast and popular library for support vector machines.
        LibSVM uses a sparse format where zero values do not need to be stored.
        Each line of a libsvm file is in the format:</p>
    <pre><code>
    &lt;label&gt; &lt;index1&gt;:&lt;value1&gt; &lt;index2&gt;:&lt;value2&gt; ...
    </code></pre>
    <p>where &lt;label&gt; is the target value of the training data.
        For classification, it should be an integer which identifies a class
        (multi-class classification is supported). For regression, it's any real
        number. For one-class SVM, it's not used so can be any number.
        &lt;index&gt; is an integer starting from 1, and &lt;value&gt;
        is a real number. The indices must be in ascending order. The labels in
        the testing data file are only used to calculate accuracy or error. If they
        are unknown, just fill this column with a number.</p>

    <p>To read a libsvm file, <code>smile.io</code> has the function</p>

    <p>Although libsvm employs a sparse format, most libsvm files contain dense data.
        Therefore, Smile also provides helper functions to convert
        it to dense arrays.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_17" data-toggle="tab">Java</a></li>
        <li><a href="#scala_17" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_17" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_17">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val glass = read.libsvm("data/libsvm/glass.txt")
    glass: Dataset[Instance[SparseArray]] = smile.data.DatasetImpl@5611bba
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_17">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> var glass = Read.libsvm("data/libsvm/glass.txt")
    glass ==> smile.data.DatasetImpl@524f3b3a
          </code></pre>
            </div>
        </div>
        <div class="tab-pane" id="kotlin_17">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> read.libsvm("data/libsvm/glass.txt")
    res118: smile.data.Dataset&lt;smile.data.SampleInstance&lt;smile.util.SparseArray&gt;&gt; = smile.data.DatasetImpl@50d667c3
    </code></pre>
            </div>
        </div>
    </div>

    <p>In case of truly sparse libsvm data, we can convert it to <code>SparseMatrix</code>
        for more efficient matrix computation.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_18" data-toggle="tab">Java</a></li>
        <li><a href="#scala_18" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_18" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_18">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> SparseDataset.of(glass).toMatrix
    res2: SparseMatrix = smile.math.matrix.SparseMatrix@290807e5
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_18">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> var glass = Read.libsvm("data/libsvm/glass.txt")
    glass ==> smile.data.DatasetImpl@17baae6e

    smile> SparseDataset.of(glass).toMatrix()
    $4 ==> smile.math.matrix.SparseMatrix@6b53e23f
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_18">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> SparseDataset.of(glass).toMatrix()
    res120: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@45db84b0
    </code></pre>
    </div>
    </div>
    </div>

    <p>Note that <code>read.libsvm</code> returns a <code>Dataset[Instance[SparseArray]]</code>
        object. The <code>Instance</code> class has both sample object and label. To convert the
        sample set to a sparse matrix, we firstly convert the <code>Dataset</code> object to
        <code>SparseDataset</code>, which doesn't have the label. We discuss the details of
        <code>SparseDataset</code> in next section.</p>

    <h3 id="sparse-format">Coordinate Triple Tuple List</h3>

    <p>The function <code>SparseDataset.from(Path path, int arrayIndexOrigin)</code>
        can read sparse data in coordinate triple tuple list format. The parameter
        <code>arrayIndexOrigin</code> is the starting index of array. By default, it is
        0 as in C/C++ and Java. But it could be 1 to parse data produced
        by other programming language such as Fortran.</p>

    <p>The coordinate file stores a list of (row, column, value) tuples:</p>
    <pre>
    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    ...
    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    </pre>

    <p>Ideally, the entries are sorted (by row index, then column index) to improve
        random access times. This format is good for incremental matrix
        construction.</p>

    <p>Optionally, there may be 2 header lines</p>

    <pre>
    D    // The number of instances
    W    // The number of attributes
    </pre>

    <p>or 3 header lines</p>

    <pre>
    D    // The number of instances
    W    // The number of attributes
    N    // The total number of nonzero items in the dataset.
    </pre>
    <p>These header lines will be ignored.</p>

    <p>The sample data <code>data/sparse/kos.txt</code> is in the coordinate format.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_19" data-toggle="tab">Java</a></li>
        <li><a href="#scala_19" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_19" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_19">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val kos = SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    kos: SparseDataset = smile.data.SparseDatasetImpl@4da602fc
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_19">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> var kos = SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    kos ==> smile.data.SparseDatasetImpl@4d826d77
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_19">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    res123: smile.data.SparseDataset! = smile.data.SparseDatasetImpl@485b4fd0
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="Harwell-Boeing">Harwell-Boeing Column-Compressed Sparse Matrix</h3>

    <p>In Harwell-Boeing column-compressed sparse matrix file, nonzero values are stored in an array
        (top-to-bottom, then left-to-right-bottom). The row indices corresponding to
        the values are also stored. Besides, a list of pointers are indexes where
        each column starts. The class SparseMatrix supports two formats for Harwell-Boeing files.
        The simple one is organized as follows:</p>

    <p>The first line contains three integers, which are the number of rows,
        the number of columns, and the number of nonzero entries in the matrix.</p>

    <p>Following the first line, there are m + 1 integers that are the indices of
        columns, where m is the number of columns. Then there are n integers that
        are the row indices of nonzero entries, where n is the number of nonzero
        entries. Finally, there are n float numbers that are the values of nonzero
        entries.</p>

    <p>The function <code>SparseMatrix.text(Path path)</code> can read this simple
        format. In the directory <code>data/matrix</code>, there are several sample files in
        the Harwell-Boeing format.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_20" data-toggle="tab">Java</a></li>
        <li><a href="#scala_20" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_20" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_20">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val blocks = SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    blocks: SparseMatrix = smile.math.matrix.SparseMatrix@4263b080
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_20">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code>
    smile> import smile.math.matrix.*;

    smile> var blocks = SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    blocks ==> smile.math.matrix.SparseMatrix@7ff95560
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_20">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code>
    >>> import smile.math.matrix.*
    >>> SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    res126: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@1a479168
    </code></pre>
    </div>
    </div>
    </div>

    <p>The second format is more complicated and powerful, called Harwell-Boeing Exchange Format.
        For details, see <a href="https://people.sc.fsu.edu/~jburkardt/data/hb/hb.html">https://people.sc.fsu.edu/~jburkardt/data/hb/hb.html</a>.
        Note that our implementation supports only real-valued matrix, and we ignore
        the optional right hand side vectors. This format is supported by the function
        <code>SparseMatrix.harwell(Path path)</code>. </p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_21" data-toggle="tab">Java</a></li>
        <li><a href="#scala_21" data-toggle="tab">Scala</a></li>
        <li><a href="#kotlin_21" data-toggle="tab">Kotlin</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_21">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> val five = SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
five: SparseMatrix = smile.math.matrix.SparseMatrix@1761de10
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_21">
            <div class="code" style="text-align: left;">
          <pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var five = SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
five ==> smile.math.matrix.SparseMatrix@6b4a4e18
          </code></pre>
            </div>
        </div>
            <div class="tab-pane" id="kotlin_21">
    <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
>>> SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
res127: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@37672764
    </code></pre>
    </div>
    </div>
    </div>

    <h3 id="wireframe">Wireframe</h3>
    <p>Smile can parse 3D wireframe models in Wavefront OBJ files.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#scala_23" data-toggle="tab">Scala</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="scala_23">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    def read.wavefront(file: String): (Array[Array[Double]], Array[Array[Int]])
    </code></pre>
            </div>
        </div>
    </div>

    <p>In the directory <code>data/wireframe</code>, there is a teapot wireframe model. In the
        next section, we will learn how to visualize the 3D wireframe models.</p>

    <ul class="nav nav-tabs">
        <li class="active"><a href="#scala_24" data-toggle="tab">Scala</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane active" id="scala_24">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    smile> val (vertices, edges) = read.wavefront("data/wavefront/teapot.obj")
    vertices: Array[Array[Double]] = Array(
      Array(40.6266, 28.3457, -1.10804),
      Array(40.0714, 30.4443, -1.10804),
      Array(40.7155, 31.1438, -1.10804),
      Array(42.0257, 30.4443, -1.10804),
      Array(43.4692, 28.3457, -1.10804),
      Array(37.5425, 28.3457, 14.5117),
      Array(37.0303, 30.4443, 14.2938),
      Array(37.6244, 31.1438, 14.5466),
      Array(38.8331, 30.4443, 15.0609),
      Array(40.1647, 28.3457, 15.6274),
      Array(29.0859, 28.3457, 27.1468),
      Array(28.6917, 30.4443, 26.7527),
      Array(29.149, 31.1438, 27.2099),
      Array(30.0792, 30.4443, 28.1402),
      Array(31.1041, 28.3457, 29.165),
      Array(16.4508, 28.3457, 35.6034),
      Array(16.2329, 30.4443, 35.0912),
      Array(16.4857, 31.1438, 35.6853),
      Array(16.9999, 30.4443, 36.894),
      Array(17.5665, 28.3457, 38.2256),
      Array(0.831025, 28.3457, 38.6876),
      Array(0.831025, 30.4443, 38.1324),
      Array(0.831025, 31.1438, 38.7764),
      Array(0.831025, 30.4443, 40.0866),
    ...
    edges: Array[Array[Int]] = Array(
      Array(6, 5),
      Array(5, 0),
      Array(6, 0),
      Array(0, 1),
      Array(1, 6),
      Array(0, 6),
      Array(7, 6),
      Array(6, 1),
      Array(7, 1),
      Array(1, 2),
      Array(2, 7),
      Array(1, 7),
      Array(8, 7),
      Array(7, 2),
      Array(8, 2),
      Array(2, 3),
      Array(3, 8),
      Array(2, 8),
      Array(9, 8),
      Array(8, 3),
      Array(9, 3),
      Array(3, 4),
      Array(4, 9),
      Array(3, 9),
    ...
    </code></pre>
            </div>
        </div>
    </div>

    <h2 id="export">Export Data and Models</h2>

    <p>To serialize a model, you may use</p>
    <ul class="nav nav-tabs">
        <li class="active"><a href="#java_25" data-toggle="tab">Java</a></li>
        <li><a href="#scala_25" data-toggle="tab">Scala</a></li>
    </ul>
    <div class="tab-content">
        <div class="tab-pane" id="scala_25">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-scala"><code>
    import smile._
    write(model, file)
    </code></pre>
            </div>
        </div>
        <div class="tab-pane active" id="java_25">
            <div class="code" style="text-align: left;">
    <pre class="prettyprint lang-java"><code>
    import smile.io.Write;
    Write.object(model, file)
    </code></pre>
            </div>
        </div>
    </div>

    <p>This method serializes the model in Java serialization format. This is handy
        if you want to use a model in Spark.</p>

    <p>You can also save a <code>DataFrame</code> to an ARFF file with the method
        <code>write.arff(data, file)</code>. The ARFF file keeps the data type information.
        If you prefer the plain csv text file, you may use the methods <code>write.csv(data, file)</code> or
        <code>write.table(data, file, "delimiter")</code>, which save a generic two-dimensional array
        with comma or customized delimiter. To save one dimensional array, simply call
        <code>write(array, file)</code>.</p>

    <div id="btnv">
        <span class="btn-arrow-left">&larr; &nbsp;</span>
        <a class="btn-prev-text" href="overview.html" title="Previous Section: What's Machine Learning"><span>What's Machine Learning</span></a>
        <a class="btn-next-text" href="visualization.html" title="Next Section: Classification"><span>Visualization</span></a>
        <span class="btn-arrow-right">&nbsp;&rarr;</span>
    </div>
</div>

<script type="text/javascript">
    $('#toc').toc({exclude: 'h1, h5, h6', context: '', autoId: true, numerate: false});
</script>
